Scalable parallel computing on clouds using Twister4Azure iterative MapReduce

نویسندگان

  • Thilina Gunarathne
  • Bingjing Zhang
  • Tak-Lon Wu
  • Judy Qiu
چکیده

Recent advances in data intensive computing for science discovery are fueling a dramatic growth in the use of dataintensive iterative computations. The utility computing model introduced by cloud computing, combined with the rich set of cloud infrastructure and storage services, offers a very attractive environment in which scientists can perform data analytics. The challenges to large-scale distributed computations on cloud environments demand innovative computational frameworks that are specifically tailored for cloud characteristics to easily and effectively harness the power of clouds. Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud. Twister4Azure extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a fault-tolerance execution of a wide array of data mining and data analysis applications on the Azure cloud. Twister4Azure utilizes the scalable, distributed and highly available Azure cloud services as the underlying building blocks, and employs a decentralized control architecture that avoids single point failures. Twister4Azure optimizes the iterative computations using a multi-level caching of data, a cache aware decentralized task scheduling, hybrid tree-based data broadcasting and hybrid intermediate data communication. This paper presents the Twister4Azure iterative MapReduce runtime and a study of four real world data-intensive scientific applications implemented using Twister4Azure − two iterative applications, Multi-Dimensional Scaling and KMeans Clustering; two pleasingly parallel applications, BLAST+ sequence searching and SmithWaterman sequence alignment. Performance measurements show comparable or a factor of 2 to 4 better results than the traditional MapReduce runtimes deployed on up to 256 instances and for jobs with tens of thousands of tasks. We also study and present solutions to several factors that affect the performance of iterative MapReduce applications on Windows Azure Cloud. KeywordsIterative MapReduce, Cloud Computing, HPC, Scientific applications, Azure

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Parallel Scientific Computing Using Twister4Azure

Recent advances in data intensive computing for science discovery are fueling a dramatic growth in use of data-intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure and storage services offers a very attractive environment for scientists to perform data analytics. The challenges to large-scale distributed c...

متن کامل

Iterative MapReduce for Azure Cloud

MapReduce distributed data processing architecture has become the de-facto data-intensive analysis mechanism in compute clouds and in commodity clusters, mainly due to its excellent fault tolerance features, scalability, ease of use and the simpler programming model. MapReduceRoles for Azure (MR4Azure) is a decentralized, dynamically scalable MapReduce runtime we developed for Windows Azure Clo...

متن کامل

Scalable programming and algorithms for data-intensive life science applications.

Cloud computing [1] offers new approaches for scientific computing that leverage the major commercial hardware and software investment in this area. Closely coupled applications are still unclear in clouds as synchronization costs are still higher than on optimized MPI machines. However loosely coupled problems are very important in many fields and can achieve good cloud performance even when p...

متن کامل

Data Mining Using Clouds: An Experimental Implementation of Apriori over MapReduce

Cloud computing has become a viable mainstream solution for data processing, storage and distribution. It promises on demand, scalable, pay-as-you-go compute and storage capacity. To analyze “big data” on clouds, it is very important to research data mining strategies based on cloud computing paradigm from both theoretical and practical views. For this purpose, we study a strategy of data minin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Future Generation Comp. Syst.

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2013